tidyverse - 101For this course, you will need to download R and RStudio. One way to get R is to go to CRAN (the Comprehensive R archive Network).
RStudio is the IDE (integrated development environment) we’ll be using. You can download it from http://www.rstudio.com/download.
We will also use the patchwork package to easily arrange many plots into one single figure.
“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
These packages help you import, tidy and understand data, to finally be able to communicate your findings in an easy way.
You can install the core packages from the tidyverse by simply typing install.packages("tidyverse") in the console.
The core tidyverse packages (ggplot2, tibble, tidyr, readr, dplyr, stringr, forcats, purrr) can be loaded then with the library() function.
There are many other packages that comply with the tidyverse data structure and practices that have to be installed and loaded independently. Some of those that we’ll be using throughout the course are tidymodels and a tidy ecosystem called tidyverts, specifically designed to provide tidy tools for time series analysis, which comprises of three main packages: tsibble, feasts and fable.
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v ggplot2 3.3.0 v purrr 0.3.4
v tibble 3.0.1 v dplyr 0.8.5
v tidyr 1.0.3 v stringr 1.4.0
v readr 1.3.1 v forcats 0.5.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
Attaching package: 㤼㸱lubridate㤼㸲
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
intersect, setdiff, union
The following objects are masked from 㤼㸱package:base㤼㸲:
date, intersect, setdiff, union
Attaching package: 㤼㸱tsibble㤼㸲
The following object is masked from 㤼㸱package:lubridate㤼㸲:
interval
The following object is masked from 㤼㸱package:dplyr㤼㸲:
id
Loading required package: fabletools
So, how does each of these packages help us tackle our data science problems?
We can say that a data science workflow (aimed at forecasting) comprises the following general steps:
Data can be imported into R in many ways. The easiest way is importing files, such as .csv, .txt or MS Excel files. However, you can also import files from other statistical softwares or even connect RStudio to a database, such as SQL.
We have available different tidyverse-friendly packages to import specific files in your device. Usually, the object which the data gets imported to is a tibble.
Tibbles are the new data.frame, that have many advantages over their predecessor, such as a convenient print() method that doesn’t fill up your console, they don’t change variable types or names, among other things. You can learn more about tibbles here.
readr package has many functions, like read_csv() or read_delim(), which allow us to import .csv, .txt, tab separated values, etc.readxl.
* The haven package lets you load data from other statistical packages, such as SPSS, Stata and SAS.
As said before, you can also connect your RStudio directly to a database. It’s straightforward to do so directly in RStudio’s IDE, just follow these simple steps:
Connections tab
Select the Data Source
Connecting to an SQL DB
All the tidyverse packages rely on having tidy data to work with. But, how can we know if our data is truly tidy? As pointed here, we need to know what values, variables and observations are and have them arranged the right way.
Values can be quantitative or qualitative.
A variable is a collection of values that measure the same attribute (sales, price, temperature, time).
An observation contains all the values measured from all variables in a single unit (a country, a day, a person, a company).
So, now, to consider our data as tidy, we must ensure that:
Every column is one variable (and only one).
Every row is just one observation.
Every cell is a single value.
The tidyr package can help us achieve this by:
reshaping data by pivotting (pivot_longer() or pivot_wider()),
rectangling, which can handle nested data (such as JSON files) into tidy tibbles (unnest_longer() or unnest_wider()),
filling missing values, by replacing them (replace_na()), dropping such observations (drop_na()), or filling with the previous or next values (fill()),
splitting and combining character columns (separate() and unite()), among many other things.
The tsibble package is the foundation for the tidyverts. It provides a data infrastructure (a tsibble), which is data- and model-oriented object, where you have:
An index, which is the temporal variable that sets the ordering mechanism.
A key, that is a set of variables that define observational units over time.
This way, every observation is uniquely identified by index and key. Also, if the time series is regularly spaced, every observational unit should be measured with a common interval.
In order to properly understand our data, we might need to transform it, visualize it, and then model it.
The dplyr package contains functions specifically designed to help you transform tidy data. You can add new variables (mutate()), choose specific variables (select()), pick observations by their values (filter()) or by their position or index in the table (slice()), sort observations (arrange).
You can also group your data (group sales by store: group_by(data, store)).
Whenever you want to manipulate character variables, the stringr package is a great way to deal with the matter. Conveniently, all the stringr functions start with str_. Some of the things you can do with it are:
Detect matches, get the index where the pattern is found, count occurences or locate the position of the pattern within the string (str_detect(), `str_which(), str_count(), str_locate()).
Subset strings, by extracting substrings from vectors (str_sub()), subsetting a tibble to return only the observations that have the pattern (str_subset()), extract the string pattern (str_extract())…
Manage lengths (str_length(), str_pad(), str_trunc(), str_trim()),
Mutate, join and split strings, (str_replace(), str_to_lower(), str_c()).
Categorical variables, or Factors, as they’re called in R can be manipulated with the forcats package.
You can have either ordered factors, or unordered factors, and some functions that would help you handle them would be:
fct_reorder(), using another variable to specify the new order.
fct_infreq(), to order ir according to the frequency of values.
fct_relevel(), to manually choose the order.
fct_lump(), to collapse the least values of a factor into “other” category.
Handling date-time variables in R can be challenging with base R. Fortunately, the lubridate package is here for you.
For example, if you import an Excel sheet with a date variable on it, it will be parsed as character. You need to convert it to a date (or date-time) variable, in order to make calculations, plots, or anything relevant with it.
There are some very intuitive parsing functions to help you out, such as ymd() for dates stored with the order year month date (it can handle many formats such as “YYYY-MM-DD”, “YYYY/MM/DD”, “YYYY MM DD”, etc.), dmy() for cases when the day comes first, followed by month and year (“DD-MM-YYYY”). Date-time objects can also be parsed with ymd_hms(), ymd_h(), dmy_hm(), hms() just to mention a few.
Proceed to Visualization
Base R plots and graphs are very basic (sometimes even ugly):
Luckily, the ggplot2 package can produce astonishing plots and figures conveniently. It relies on the “Grammar of Graphics”, where you:
provide the data,
specify how you want to map the variables to aesthetics,
what type of plot or graph you want to produce
any other customization you’d like,
and ggplot2 takes care of it.
So, to generally describe how a ggplot2 plot works is as follows:
Start with a ggplot() object, where you specify the data to be used,
supply the aesthetic mapping (with aes()),
add on layers:
geom_point(), histogram geom_hist(). Other common plots are geom_line(), geom_bar(), geom_boxplot().scale_color_brewer() or scale_color_distiller(),facet_wrap() or facet_grid()coord_cartesian(), coord_flip()Every element is separated with a plus sign (+):
It’s important to mention that the aesthetics can be passed inside the ggplot() function, or within a graphic primitive. In the former, the aesthetics are the same for all the layers, whereas in the latter, the aesthetics passed to a specific layer only affect that layer.
We could not cover all the different variants that can be achieved with ggplot2 here, even if we tried.
Continue to Exploratory Data Analysis
Visually inspecting your data can give you insight to their dynamics, patterns, and historic behavior. However, before going into the modelling phase of the analysis, we must perform the exploratory data analysis (EDA).
EDA involves making hypothesis regarding your data, transform and visually inspect statistical properties of it.
Some of the most important things to note on the EDA process are:
What type of variation do each variable have?
What is the covariation between variables?
Are there outliers present in the data?
What type of distribution do the variables follow?
Some examples
For categorical variables (factors), we can use a bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut)) +
ggtitle("Count of Diamonds by cut quality")For continuous variables, a histogram can be used:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5) +
ggtitle("Histogram of carats")If you want to analyze the histogram of multiple variables in one single plot, you can use geom_freqpoly() or use facetting:
ggplot(data = diamonds %>% filter(carat < 3), mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)ggplot(data = diamonds %>% filter(carat < 3), mapping = aes(x = carat, fill = cut)) +
geom_histogram(binwidth = 0.1) +
facet_wrap(~ cut) +
theme(legend.position = "none")Choosing the binwidth of the histogram can tell different stories or can reveal different patterns:
g <- ggplot(data = diamonds %>% filter(carat < 3), mapping = aes(x = carat))
g0 <- g + geom_histogram(binwidth = 0.5) +
ggtitle("Binwidth = 0.5")
g1 <- g + geom_histogram(binwidth = 0.1) +
ggtitle("Binwidth = 0.1")
g2 <- g + geom_histogram(binwidth = 0.01) +
ggtitle("Binwidth = 0.01")
g3 <- g0/g1/g2
g3 + plot_annotation(title = "Histograms varying the binwidth",
subtitle = "Different patterns can arise when selecting different binwidths")Another option is to go with boxplots:
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))+ labs(x = "class", y = "hwy mpg")Again, these are just some examples, but the list of goes on and on.
Regarding time series analysis, the feasts package (Feature Extraction and Statistics for Time Series) has many functions that makes it easy for us to get further insight into our time series, We can get many graphs from it, such as:
We can also get statistics from this package:
Go to Model
Whenever we try to model a variable or phenomenom, it is said that we are trying to get a simplified version of reality. In fact, it’s simpler than that. What we are really trying to do is understand the way a variable or set of variables change, while ignoring or eliminating external “noise”.
It is well known to most data scientists that you cannot use the same data for modelling and testing (or forecasting). That’s way many people split their data into a training and testing set. However, a true data scientist has to be even stricter on its use of data:
It is recommended that you split your data into three sets:
1. ~ 60% of your data goes to the training set. Here, you can visualize it, perform all the model fitting and tweaking you want, over and over again.
2. ~ 20% of your data should go to a query set. With this query set, you can compare models by hand and visualize the outcomes.
3. The ~ 20% remaining data would conform the test set. Once you’ve compared all your models with the training and/or query sets, you can test your final model. This test can only be performed ONCE. This ensures no bias is introduced in the model and it remains a true forecast.
We will go through many different models aimed to provide forecasts for different situations.
The family of models we will be studying throughout the course are the following:
tidymodels.There are many ways to train models in R. As we said before, we will be using primarily two packages for this:
{width:15%}
tidyverts, we will use the fable package, which has many univariate and multivariate time series forecasting models. The way to fit one or more models in the fable package is by specifying them in the model() function.fit <- aus_retail %>%
filter(
State %in% c("New South Wales", "Victoria"),
Industry == "Department stores"
) %>%
model(
ets = ETS(box_cox(Turnover, 0.3)),
arima = ARIMA(log(Turnover)),
snaive = SNAIVE(Turnover)
)
|=================== | 50% ~6 s remaining
|========================== | 67% ~4 s remaining
|================================ | 83% ~1 s remaining
|=======================================|100% ~0 s remaining
This produces a mable (a model table), where every cell is a fitted model.
Using the report() function, you can get a detailed view of a particular model.
Series: Turnover
Model: ARIMA(2,1,1)(1,1,2)[12]
Transformation: log(.x)
Coefficients:
ar1 ar2 ma1 sar1 sma1 sma2
-0.2700 -0.1007 -0.7738 0.2676 -0.8690 0.0254
s.e. 0.0612 0.0588 0.0410 0.2378 0.2423 0.1818
sigma^2 estimated as 0.002547: log likelihood=668.31
AIC=-1322.62 AICc=-1322.36 BIC=-1294.21
You could use the tidy() function to retrieve each model’s terms, estimates and statistics in a tidy way.
fit %>%
filter(State == "Victoria") %>%
select(State,Industry,arima) %>%
gg_tsresiduals() + ggtitle("Residual Diagnostics for the ARIMA model fitted to Victoria")With the glance() function we can get further information for every model fitted.
Go to Forecasting
Once we have trained one or more models that have a proper fit to our training data, we can now go ahead and produce forecasts. Using fable is as simple as running the forecast() function to our mable. It’s also quite easy to get plots from it:
fcst %>%
autoplot(filter(aus_retail, year(Month) > 2010), level = NULL) +
ggtitle("Forecast comparison across models and time series")fcst %>%
filter(.model =="ets") %>%
autoplot(filter(aus_retail, year(Month) > 2010)) +
ggtitle("Forecast (with prediction intervals) for the ETS model")In practice, you could automate the forecasting workflow by defining functions and iterations. The complete
data("world_bank_pop")
# A tibble: 1,056 x 20
# country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007`
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 ABW SP.URB.T~ 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4 4.47e+4
# 2 ABW SP.URB.G~ 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1
# 3 ABW SP.POP.T~ 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5 1.01e+5
# 4 ABW SP.POP.G~ 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1 3.84e-1
# 5 AFG SP.URB.T~ 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6 6.15e+6
# 6 AFG SP.URB.G~ 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0 3.65e+0
# 7 AFG SP.POP.T~ 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7 2.59e+7 2.66e+7
# 8 AFG SP.POP.G~ 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0 3.23e+0 2.76e+0
# 9 AGO SP.URB.T~ 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7 1.15e+7 1.21e+7
# 10 AGO SP.URB.G~ 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0 4.92e+0 4.89e+0
# ... with 1,046 more rows, and 10 more variables: `2008` <dbl>, `2009` <dbl>,
# `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
# `2015` <dbl>, `2016` <dbl>, `2017` <dbl>purrr.